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Abstract: Randomized sampling has recently been proven a highly efficient tech¬ 
nique for computing approximate factorizations of matrices that have low numeri¬ 
cal rank. This paper describes an extension of such techniques to a wider class of 
matrices that are not themselves rank-deficient, but have off-diagonal blocks that 
are; specifically, fhe classes of so called Hierarchically Off-Diagonal Low Rank 
(HODLR) mafrices and Hierarchically Block Separable (HBS) mafrices (a.k.a. “Hi¬ 
erarchically Semi-Separable (HSS)” mafrices). Such mafrices arise frequenfly in 
numerical analysis and signal processing, in particular in fhe consfrucfion of fasf 
mefhods for solving differential and infegral equations numerically. These sfruc- 
fures admif algebraic operafions (mafrix-vecfor mulfiplicafions, mafrix facloriza- 
fions, mafrix inversion, efc.) fo be performed very rapidly; buf only once a dafa- 
sparse represenfafion of fhe mafrix has been consfrucfed. How fo rapidly compute 
fhis represenfafion in fhe firsf place is much less well undersfood. The presenf paper 
demonsfrafes fhaf if an iV x mafrix can be applied fo a vecfor in 0{N) lime, and 
if fhe ranks of fhe off-diagonal blocks are bounded by an infeger k, fhen fhe cosl 
for consfrucling a HODLR represenfafion is 0{k‘^ N {log N)'^), and fhe cosl for 
constructing an HBS representation is 0{k‘^ N log A^) (assuming of course, that 
the matrix is compressible in the respective format). The point is that when legacy 
codes (based on, e.g., the Fast Multipole Method) can be used for the fast matrix- 
vector multiply, the proposed algorithm can be used to obtain the data-sparse rep¬ 
resentation of the matrix, and then well-established techniques for HODLR/HBS 
matrices can be used to invert or factor the matrix. The proposed scheme is also 
useful in simplifying the implementation of certain operations on rank-structured 
matrices such as the matrix-matrix multiplication, low-rank update, addition, etc. 

1. Introduction 

A ubiquitous task in computational science is to rapidly perform linear algebraic operations involving very 
large matrices. Such operations typically exploit special “structure” in the matrix since the costs of standard 
techniques tend to scale prohibitively fast with matrix size; for a general N x N matrix, it costs 0{N‘^) 
operations to perform a matrix-vector multiplication, 0{N^) operations to perform Gaussian elimination or to 
invert the matrix, etc. A well-known form of “structure” in a matrix is sparsity. When at most a few entries 
in each row of the matrix are non-zero (as is the case, e.g., for matrices arising upon the discretization of 
differential equations, or representing the link structure of the World Wide Web) matrix-vector multiplications 
can be performed in O(A^) operations instead of 0{N‘^). The description “data-sparse” applies to a matrix that 
may be dense, but that shares the key characteristic of a sparse matrix that some linear algebraic operations, 
typically the matrix-vector multiplication, can to high precision be executed in fewer than 0{N‘^) operations 
(often in close to linear time). 

Several different formats for rank-structured matrices have been proposed in the literature. In this manuscript, 
we rely on the so called Hierarchically Off-Diagonal Low Rank (HODLR) format. This name was minted in ||T1 , 
but this class of matrices has a long history. It is a special case of the 2f-matrix format introduced by Hackbusch 
and co-workers |[T9l [3l. and was used explicitly in Sec. 4]. The HODLR format is very easy to describe and 
easy to use, but can lead to less than optimal performance due to the fact that the basis matrices used to represent 
large blocks are stored explicitly, leading to a 0{k N log(A^)) storage requirement for a HODLR matrix whose 


off-diagonal blocks have rank at most k. To attain linear storage requirements and arithmetic operations, one 
can switch to a format that expresses all basis matrices hierarchically, in other words, the basis matrices used 
on one level are expressed implicitly in terms of the basis matrices on the next finer level. We sometimes say 
that we use nested basis matrices. To be precise, we use the Hierarchically Block Separable (HBS) format that 
was described in ll2^ [T5l . This format is closely related to the Hierarchically Semi-Separable (HSS) iTTl 1^ 
format, and is also related to the Tf^-matrix format ll^ ldll. 

The most straight-forward technique for computing a data-sparse representation of a rank-structured N x N 
matrix A is to explicitly form all matrix elements, and then to compress the off-diagonal blocks using, e.g., the 
SVD. This approach can be executed stably 1351 |20l, but it is often prohibitively expensive, with an 0{kN‘^) 
asymptotic cost, where k is the rank of the off-diagonal blocks (in the HSS-sense). Fortunately, there exist for 
specific applicafions much fasfer mefhods for construcfing HSS represenfafions. When fhe mafrix A approxi- 
mafes a boundary infegral operafor in fhe plane, fhe fechnique of 1281 compufes a represenfafion in 0{k^ N) 
fime by exploifing represenfafion resulfs from pofenfial fheory. In ofher environmenfs, if is possible fo use 
known regularify properties of fhe off-diagonal blocks in conjunction wifh inferpolafion fechniques fo obfain 
rough inifial facforizafions, and fhen recompress fhese fo obfain facforizafions wifh close fo opfimal ranks 
in 123 . A particularly popular version of fhe “regularify - 1 - recompression” mefhod is fhe so called Adaptive 
Cross Approximation fechnique which was inifially proposed for Tf-mafrices m |5] |23l buf has recenfly been 
modified fo obfain a represenfafion of a mafrix in a formaf similar fo fhe HSS 1T2]| . 

The purpose of fhe presenf paper is fo describe a fasf and simple randomized fechnique for compufing a dafa 
sparse represenfafion of a rank-sfrucfured mafrix which can rapidly be applied fo a vecfor. The exisfence of 
such a fechnique means fhaf fhe advanfages of fhe HODLR and HBS formafs — fasf inversion and facforizafion 
algorifhms in particular — become available for any mafrix fhaf can currenfly be applied via fhe FMM, via 
an Tf-mafrix calculafion, or by any ofher exisfing dafa-sparse formaf (provided of course fhaf fhe mafrix is in 
principle rank-sfrucfured). In order fo describe fhe cosf of fhe algorifhm precisely, we infroduce some nofafion: 
We lef A be an X iV mafrix whose off-diagonal blocks have maximal rank k, we lef Tmult denofe fhe time 
required to perform a matrix-vector multiplication x 1 —;■ A x or x 1 —)■ A* x, we let Tj-and denote the cost of 
constructing a pseudo random number from a normalized Gaussian distribution, and Tfjop denote the cost of a 
floating point operation. The computational cost Ttotai of the algorithm for the HBS format then satisfies 

Ttotal ~ Tmult X k log(iV) -f Trand X {kp) N log(iV) -f Tflop X k"^ N log(iV), (1) 

where p is a funing paramefer fhaf balances compufafional cosf againsf fhe probabilify of nof meeting fhe 
requesfed accuracy. Setting p = 10 is oflen a good choice which leads fo a “failure probabilify” of less fhaf 
10“®, see Remarkj^ In particular, if Tmuit is 0{N), fhen fhe mefhod presenfed here has overall complexify 
0{k'^ N log(A')). For fhe HODLR formaf, an additional factor of log N arises, cf. ( p^ . 

The work presenfed is direcfly inspired by ll27l (which is based on a 2008 preprinf 1261), which presenfed a 
similar algorifhm wifh 0{k^ N) complexify for fhe compression of an HBS mafrix. This is better by a facfor 
of log(A^) compared fo fhe presenf work, buf fhe algorifhm of ll27l has a serious limifafion in fhaf if requires 
fhe abilify fo evaluafe 0{k N) enfries of fhe mafrix fo be compressed. This algorifhm was lafer refined by Xia 
Il33l and applied fo fhe fask of accelerafed fhe “nesfed dissection” direcf solver |[T3l [TTI for ellipfic PDFs fo 
0{N) complexify. In 2011, by L. Lin, J. Lu, and L. Ying presented an alternative algorifhm ll25l fhaf inferacfs 
wifh fhe mafrix only via fhe mafrix-vecfor mulfiplicafion, which makes fhe randomized compression idea much 
more broadly applicable fhan fhe algorifhm in ETl . However, fhis came af fhe cosf of requiring 0{k log(Af)) 
mafrix-vecfor mulfiplies (jusf like fhe presenf work). The algorifhm proposed here is an evolution of ||251 fhaf 
does away wifh a sfep of leasf-square tiffing fhaf led fo a magnificafion of fhe sampling error resulfing from fhe 
randomized approximafion. Moreover, we here presenf a new sfrafegy for enforcing fhe “nesfedness” of fhe 
basis mafrices fhaf is required in fhe HBS formaf. 

Remark 1. The technique described in this paper utilizes a method for computing approximate low-rank factor¬ 
izations of matrices that is based on randomized sampling As a consequence, there is in principle 


a non-zero risk that the method may fail to produce full accuracy in any given realization of the algorithm. This 
risk can be controlled by the user via the choice of the tuning parameter p in 0. for details see Section 
Moreover, unlike some better known randomized algorithms such as Monte Carlo, the accuracy of the output of 
the algorithms under discussion here is typically very high; in the environment described in the present paper, 
approximation errors of less than 10“^*^ are entirely typical. 


2.5 


2. Preliminaries 


2.1. Notation. Throughout the paper, we measure vectors in using their Euclidean norm. The default norm 
for matrices will be the corresponding operator norm || A|| = sup||x||=i || Ax||, although we will sometimes also 

use the Frobenius norm II Allpro = (j I j 


We use the notation of Golub and Van Loan liThll to specify submatrices. In other words, if B is an m x n matrix 
with entries bij, and I = [ii, i 2 , ..., ik] and J = [ji, j 2 , ..., ji] are two index vectors, then we let B(/, J) 
denote the k x i matrix 


B(/,J) 


^hji ^hj2 ■ ■ ■ ^hje 

bi2jl bi2j2 ■ ■ ■ ^i2jt 

^ikjl ^ikj2 ■ ■ ■ ^ikje 


We let B(/, :) denote the matrix B(/, [1,2,..., n]), and define B(:, J) analogously. 


The transpose of B is denoted B*, and we say that a matrix U is orthonormal if its columns form an orthonormal 
set, so that U* U = /. 


2.2. The QR factorization. Any m x n matrix A admits a QR factorization of the form 


A P = Q R, 

m X n n X n m x r r x n 


( 2 ) 


where r = min(m, n), Q is orthonormal, R is upper triangular, and P is a permutation matrix. The permutation 
matrix P can more efficiently be represented via a vector G Z” of column indices such that P = l(:, Jf) 
where I is the n x n identity matrix. As a result, the factorization can be written as: 


A(:,J,) = Q R, 
m X n m X r r x n 

The QR-factorization is often built incrementally via a greedy algorithm such as column pivoted Gram-Schmidt. 
This allows one to stop after the first k terms have been computed to obtain a “partial QR-factorization of A”: 

A(:,Jc) = Q(2)] 

m X n m X r 

That is, taking the first k columns of Q and the first k rows of R, we can obtain the approximation: 

A(:,J,)«QfcRfc (3) 

We note that the partial factors and R^ can be obtained after k steps of the pivoted QR algorithm, without 
having to compute the full matrices Q and R. 


■r(i)- 
r(2)J ’ 

r X n 







2.3. The singular value decomposition (SVD). Let A denote an m x n matrix, and set r 
A admits a factorization 

A = U E V*, 

m X n m X r r x r r x n 


min(m, n). Then 


(4) 


where the matrices U and V are orthonormal, and E is diagonal. We let and denote the 

columns of U and V, respectively. These vectors are the left and right singular vectors of A. The diagonal 
elements {aj}j^i of E are the singular values of A. We order these so that cri > cr 2 > • • • > > 0. We let 

Afc denote the truncation of the SVD to its first k terms, so that A^ = Yl’j=i ^*j- easily verified fhat 


Afc llspectrai — ^k+ii and thaf 


lA-A 


k \ \ Fro — 


/ ■ ! \ \ 1/2 

E 

j=k+l 


where || • || spectral denotes the operator norm and || • Upro denotes the Frobenius norm. Moreover, the Eckart- 
Young theorem states that these errors are the smallest possible errors that can be incurred when approximating 
A by a matrix of rank k. 


2.4. The interpolatory decomposition (ID). Let A denote an m x n matrix of rank k. Then A admits the 
factorization 

A = A(:,J) X, 

m X n m X k k x n 

where J is a vector of indices marking k of the columns of A, and the k x n matrix X has the k x k identity 
matrix as a submatrix and has the property that all its entries are bounded by 1 in magnitude. In other words, 
the interpolative decomposition picks k columns of A as a basis for the column space of A and expresses the 
remaining columns in terms of the chosen ones. The ID can be viewed as a modification to so the called Rank- 
Revealing QR factorization Q. It can be computed in a stable and accurate manner using the techniques of 
OH, as described in O. (Practical algorithms for computing the interpolative decomposition produce a matrix 
X whose elements slightly exceed 1 in magnitude.) 


2.5. Randomized compression. Let A be a given m x n matrix that can accurately be approximated by a 
matrix of rank k, and suppose that we seek to determine a matrix Q with orthonormal columns (as few as 
possible) such that 

||A-QQ*A|| 

is small. In other words, we seek a matrix Q whose columns form an approximate orthornomal basis (ON-basis) 
for the column space of A. This task can efficiently be solved via the following randomized procedure: 

(1) Pick a small integer p representing how much “over-sampling” we do. (p = 10 is often good.) 

(2) Lorm an n x {k p) matrix Q whose entries are iid normalized Gaussian random numbers. 

(3) Lorm the product Y = A fi. 

(4) Construct a matrix Q whose columns form an ON-basis for the columns of Y. 


Note that each column of the “sample” matrix Y is a random linear combination of the columns of A. We 
would therefore expect the algorithm described to have a high probability of producing an accurate result when 
p is a large number. It is perhaps less obvious that this probability depends only on p (not on m or n, or any 
other properties of A), and that it approaches 1 extremely rapidly as p increases. In fact, one can show that the 
basis Q determined by the scheme above satisfies 


A-QQ*A|| < 


1 -h ll^/k + p • min{m, n} 


o'fc-i-i, 


(5) 


with probability at least 1 — 6 • p see Il22l Sec. 1.5]. The error bound (j^ indicates that the error produced 
by the randomized sampling procedure can be larger than the theoretically minimal error <Tfc_|_i by a factor of 






1 + ll\/k + p ■ Y^min{m, n}. This crude bound is typically very pessimistic; for specific situations sharper 
bounds have been proved, see fill . 

Definition 1. Let ^ be an m x n matrix, and let e be a positive number. ITe say that an m x i matrix Y an 
e-spanning matrix for A, if \\ A — YY^A|| < e. Informally, this means that the columns ofY span the column 
space A to within precision e. Furthermore, we say that an m x i matrix Q is an e-basis matrix for A if its 
columns are orthonornial, and if\\A — QQ*A|| < e. 

2.6. Functions for low-rank factorizations. For future reference, we introduce two functions “qr” and 
“svd” that can operate in three difference modes. In the first mode, they produce the full (“economy size”) 
factorizations described in section [T2| and [23j respectively, 

[Q,R, J] = qr(A), [U,D,V] = svd(A), and [X,J] = id(A). 

In practice, we execute these factorizations using standard LAPACK library functions. In the second mode, we 
provide an integer k and obtain partial factorizations of rank k, 

[Q, R, J] = qr(A, k), [U, D, V] = svd(A, fc), and [X, J] = id(A, fc). 

Then the matrices Q, U, D, V have precisely k columns, and R has precisely k rows. In the third mode, we 

provide a real number e G (0,1) and obtain partial factorizations 

[Q, R, J] = qr(A, e), [U, D, V] = svd(A, e), and [X, J] = id(A, e), 

such that 

||A(: , J) - QR|| < e ||A - UDV*|| < e, and ||A - A(:, J)X|| < e. 

In practice, for a small input matrix A, we execute mode 2 and mode 3 by calling the LAPACK routine for a 
full factorization (the ID can be obtained from the full QR), and then simply truncating the result. If A is large, 
then we use the randomized sampling technique of Section [23] 

Remark 2. The differentiation between modes 2 and 3 for qr and svd is communicated by whether the second 
argument is an integer (mode 2) or a real number e G (0,1) (mode 3). This is slightly questionable notation, 
but it keeps the formulas clean, and hopefully does not cause confusion. 

2.1. A binary tree structure. Both the HODLR and the HBS representations of an M x M matrix A are 
based on a partition of the index vector I = [1, 2, ..., M] into a binary tree structure. We let I form the root 
of the tree, and give it the index 1, Ii = I. We next split the root into two roughly equi-sized vectors I 2 and I 3 
so that /i = /2 U Is. The full tree is then formed by continuing to subdivide any interval that holds more than 
some preset fixed number m of indices. We use the integers = 0, 1, ..., L to label the different levels, with 
0 denoting the coarsest level. A leaf is a node corresponding to a vector that never got split. For a non-leaf 
node r, its children are the two boxes ai and ct 2 such that U 1 ^ 2 , and r is then the parent of cji and 

CJ 2 . Two boxes with the same parent are called siblings. These definitions are illustrated in Figure [T] For any 
node T, let Ur denote the number of indices in Ir. 

2.8. The HODLR data sparse matrix format. The Hierarchically Off-Diagonal Low Rank (HODLR) prop¬ 
erty is, as the name implies, a condition that the off-diagonal blocks of a matrix A should have low (numerical) 
rank. To be precise, given a hierarchical partitioning of the index vector, cf. Section [2.7] a computational 
tolerance e, and a bound on the rank k, we require that for any sibling pair {a, (3}, the corresponding block 

^a,l3 — A(/q,, Ljs) 

should have e-rank at most k. The tessellation resulting from the tree in Figure[T]is shown in Figure]^ We then 
represent each off-diagonal block via a rank-k factorization 

Ua X Ua X k k X k k X njj 





Figure 1. Numbering of nodes in a fully populated binary tree with L = 3 levels. The root 
is the original index vector / = Ii = [1, 2, ..., 400]. 



Figure 2. A HODLR matrix A tesselated in accordance with the tree in Figure [T] Every 
off-diagonal block that is marked in the figure should have e-rank at most k. 


where tla and B 13 are orthonormal matrices. It is easily verified fhaf if we required each leaf node fo have 
af mosf 0{k) nodes, fhen if fakes 0{kN logN) storage fo store all factors required to represenf A, and a 
mafrix-vector mulfiplicafion can be executed using 0{k N log N) flops. 

For fulure reference, we define for a given HODLR mafrix A a “level-fruncafed” mafrix A^^^ as the matrix 
obtained by zeroing out any block associated with levels finer than i. In other words. 


A(i) = 


0 

A2,3 

A3,2 

0 


A(2) = 


0 

A4,5 

A2,3 

A5,4 

0 

> 

CO 

0 

Ae.T 

At, 6 

0 


etc. 


Remark 3. In our description of rank-structured matrices in sections 2.8 and 2.9 we generally assume that 


the numerical rank is the same number k for every off-diagonal block. In practice, we typically estimate the 
e-rank for any specific off-diagonal block adaptively to save both storage and flop counts. 






























2.9. The HBS data sparse matrix format. The HODLR format is simple to describe and to use, but is slightly 
inefficient in that it requires the user to store for every node r, the basis matrices Ur and Vr, which can be quite 
long. The Hierarchically Block Separable (HBS) format is designed to overcome this problem by expressing 
these matrices hierarchically. To be precise, suppose that r is a node which children {a, /3}, and that we can 
find a short matrix Ur such that 


Ur 

Ur X k 


Ua 0 I I 

0 Up \ 

Ur X “^k 2k X k 


( 6 ) 


The point is that if we have the long basis matrices Ua and Up available for the children, then all we need to 
store in order to be able to apply Ur is the short matrix Ur. This process can now be continued recursively. 
For instance, if { 7 , <5} are the children of a, and { 0 , p} are the children of (3, we assume there exist matrices 
Uq, and U^ such that 


Ua = 

■ U^ 

0 

1 - 

U 

a? 

and 

Up 

= 

1 _ 

Ua X k 

Ha 

X 2k 

2k 

X k 



np X k 


np X 2k 

Then Ur can be expressed as 












■ U.y 

0 

0 

0 





Ur 

= 

0 

0 

Us 

0 

0 

U^ 

0 

0 



0 

U/3 . 

Ur- 




0 

0 

0 

u^ 




n 

r X k 



Ur 

X 4k 


4k X 

2k 

2k X k 


By continuing this process down to the leaves, it becomes clear that we only need to store the “long” basis 
matrices for a leaf node (and they are not in fact long for a leaf node!); for every other node, it is sufficient to 
store the small matrix Ur. The process for storing the long basis matrices Vr via small matrices Vr of size 
2k X k is of course exactly analogous. 

For a general HODLR matrix, there is no guarantee that a relationship such as Q need hold. We need to 
impose an additional condition on the long basis matrices Ur and Vr- To this end, given a node r, let us 
define a neutered row block as the off-diagonal block /K{Ir, I!)), where K) is the complement of Ir within the 
vector [1, 2, 3,..., N], cf. Figure]^ We then require that the columns of the long basis matrix Ur must span 
the columns of A(/r, 1^)- Observe that for a node r with sibling a, the sibling matrix A(Ir, 1^) is a submatrix 
of the neutered row block A(/r, 1^) since C This means that the new requirement of the basis matrices 
is more restrictive, and that typically the ranks required will be larger for any given precision. However, once 
the long basis matrices satisfy the more restrictive requirement, it is necessarily the case that Q holds for some 
small matrix U,-. 


We analogously define the neutered column block for r as the matrix A(/);, Ir) and require that the columns of 
Vr span the rows of A(/):, Ir). 


Definition 2. We say that a HODLR matrix A is an HBS matrix if, for every parent node r with children {a, (3}, 
there exist “small” basis matrices Ur and Vr such that 


Ur 

rir X k 


1_ 

0 

Up 

Ur, 

and 

Vr = 


0 

Vp 

Ur 

X 2k 

2k X k 


Ur X k 

Ur 

X 2k 


While standard practice is to require all basis matrices Ur and Vr to be orthonormal, we have found that 
it is highly convenient to use the interpolatory decomposition (ID) to represent the off-diagonal blocks. The 
key advantage is that then the sibling interaction matrices which be submatrices of the original matrix. This 
improves interpretability, and also slightly reduces storage requirements. 
















u 


h 


(a) (b) 

Figure 3. Illustration of the neutered row blocks for the nodes 4 and 5, with parent 2. (a) The 
bloek A(l 4 , Jl) is marked in grey. Observe that A(/ 4 ,/|) = [A(/ 4 , Is), A(/ 4 , 12 )]. (b) The 
bloek A(/5, Jg) is marked in grey. Observe that A(l5, /|) = [A(/5,14), A(/5,12)]. 

Definition 3. We say that a HBS matrix A is in HBS-ID format if every basis matrix U,- and Vt- contains a 
k X k identity matrix, and every siblin interaction matrix submatrix of A. In other words, there exist 

some index sets <^nd such that 

^a,l3 = A(/q,, I^). 

The index sets and Ir are called the row skeleton and column skeleton of box t, respectively. We enforce 
that these are “nested,” which is to say that if the children of t are {a, /?}, then 

It da^ Ip Cind It ^ la^ Ip- 

Remark 4. The straight-forward way to build a basis matrix lAr for a node r is to explicitly form the cor¬ 
responding neutered row block A(I. ,Ilf) and then compress it (perform column pivtoed Gram-Schmidt on 
its columns to form an ON basis tlr, or perform column pivoted Gram-Schmidt on its rows to form the ID). 
However, suppose that we can somehow construct a smaller matrix Yr of size rir x i with the property that 
the columns ofY^ span the columns of ^(Ir, If). Then it would be sufficient to process the columns ofY^ 
to build a basis for A{It, If)- For instance, if we orthonormalize the columns ofYr to form a basis matrix 
tlr = qriYr), then the columns oflAr will necessarily form an orthonormal basis for the columns of^{Ir, If). 
The key point here is that one can often find such a matrix Y^ with a small number (. of columns. In ll28ll we use 
a representation theorem from potential theory to find such a matrix Y^ when A comes from the discretization 
of a boundary integral equation of mathematical physics. In Il27t we do this via randomized sampling, so that 
Yt- = /^{It, If Ft for some Gaussian random matrix Q.. The main point of II27II is that this can be done by 
applying all of ^ to a single random matrix with N x i columns, where i ^ k. In the current manuscript, we 
use a similar strategy, but we now require the application of ^ to a set ofO{k) random vector for each level. 

3. An algorithm for compressing a HODLR matrix 

The algorithm consists of a single sweep through the levels in the tree of nodes, starting from the root (the 
entire domain), and processing each level of successively smaller blocks at a time. 

In presenting the algorithm, we let G,- denote a random matrix of size n,- x r drawn from a Gaussian distribution. 
Blocked matrices are drawn with the blocks to be processed set in red type. To minimize clutter, blocks of the 
matrix that will not play a part in the current step are marked with a star (“*”) and may or may not be zero. 







































Processing level 0 (the root of the tree): Let {a,/3} denote the children of the root node (in our standard 
ordering, a = 2 and /3 = 3). Our objective is now to find low-rank factorizations of the off-diagonal blocks 


A = 


* 


^a/3 

* 


To this end, we build two random matrix, each of size N x r, and defined by 


«! = 


and 


Q 


2 — 


0 

G/, 


Then consfrucf fhe mafrices of samples 


Yi = An2 


^ 0 / 3^13 

* 


and Ya = Afii 


* 

A/3aGci 


Supporfed by fhe resulfs on randomized sampling described in Secfion [23| we now know fhaf if is almosf cerfain 
fhaf fhe columns in fhe lop block of Yi will span fhe columns of f^ai3- By orfhonormalizing fhe columns of 
Yi(/q, :), we Iherefore obfain an ON-basis for fhe column space of A^^. In ofher words, we sef 


Ua = qr(Yi(/„,:)), and Up = qr(Y 2 (/^,:)), 


and fhen we know fhaf Ua and Up will serve as fhe relevanf basis mafrices in fhe HODLR represenlalion of A. 
To compufe Va, Vp, and we now need lo form fhe mafrices U*^fKap and U*pf^pa- To do Ihis Ihrough 
our “black-box” mafrix-mafrix mulfiplier, we form fhe new fesf mafrices 


«! = 

Then consfrucf fhe mafrices of samples 
Zi = A*n2 = 


Ua 

0 




and 


and 


All fhaf remains is now lo compufe QR faclorizalions 


n 


2 — 


0 

Up 


Za = A*ni = 


KpUa 


[Va, = qr(Zi(/„,:)), and [V/?, Ba^j] = qr(Za(// 3 ,:)). 


(V) 


Remark 5. At a slight increase in cost, one can obtain diagonal sibling interaction matrices ^ap tind B^q. We 
would then replace the QR factorization in ^ by a full SVD 

[Va, ^Pa, U/3] = svd{Zi{Ia, :)), and [Vp, ^ap, Ua] = svd{Z 2 {Ip, :)). (8) 

This step then requires an update to the long basis matrices for the column space: 

U-a ^— Ua^a and Up i — Up^p. (9) 


Processing level 1: Now fhaf all off-diagonal blocks on level 1 have been compuled, we can use Ibis informafion 
fo compress fhe blocks on level 2. We lei {a, /3, 7 , <5} denole fhe boxes on level 2 (in slandard ordering, a = 4, 
/3 = 5, 7 = 6 , (5 = 7). Our objective is now lo consfrucf low-rank approximafions fo fhe blocks marked in red: 


Firsl, observe fhaf 


* 

Aa/3 


A/3a 

* 



A-y^ 


* 


* 

Aa/3 

0 

A/3a 

* 

0 


A-y^ 

Af^'y 

* 


A - A(^) 


































We then define two random test matrices, each of size N x r, via 



Go 



0 

«! = 

0 

G, 

and 

^2 = 

G/3 

0 


0 





We compute the sample matrices via 



AoygG/? 



* 

Yi = Afi2-A(i)n2 = 

* 

* 

and 

Y 2 = AJ2i - = 

* 

A(5'y G.^ 


In evaluating Yi and Y 2 , we use the black-box multiplier to form /KQ 2 and Af2i, and the compressed repre¬ 
sentation of obtained on the previous level to evaluate A*^^^n2 and A*^^^f2i. We now obtain orthonormal 
bases for the column spaces of the four sibling interaction matrices by orthonormalizing the pertinent blocks of 
Yi and Y 2 : 


W„ = qr(Yi(4,:)), = qr(Y 2 (/^,:)), = qr(Yi(/^,:)), Us = qr{Y 2 ( 15 , :))■ 

It remains to construct the ON-bases for the corresponding row spaces and the compressed sibling interaction 
matrices. To this end, we form two new test matrices, both of size N x r, via 



■ Ua ' 



0 

fit = 

0 

u^ 

and 

^2 = 

Up 

0 


0 



Us 


Then the sample matrices are computed via 


Zi = A*fi2 - (A(i^)*n2 = 


* 


and 


Z 2 = A*ni - = 






We obtain diagonal compressed sibling interaction matrices by taking a sequence of dense SVDs of the relevant 
sub-blocks, cf. 


[Va, Uy 3 ] = svd(Zi(/o,:)), 

[V/ 3 , Ba/3, U„] = svd(Z 2 (// 3 ,:)), 

[V^, 0^] = svd(Zi(/.^,:)), 

[V 5 , B.^ 5 , U.y] = svd(Z2(/5,:)). 

Finally update the bases for the column-spaces, cf Q, 

Ua ^— U(yiSa^ U.jS ^— U-jS^fS-} Uj i — U^ij'y^ Us ^— Us^S' 


Processing levels 2 through L — 1.- The processing of every level proceeds in a manner completely analogous 
to the processing of level 1. The relevant formulas are given in Figure]^ 

Processing the leaves: Once all L levels have been traversed using the procedure described, compressed rep¬ 
resentations of all off-diagonal blocks will have been computed. At this point, all that remains is to extract the 


















diagonal blocks. We illustrate the process for a simplistic example of a tree with only L = 2 levels (beyond the 
root). Letting {a, /?, 7 , (5} denote the leaf nodes, our task is then to extract the blocks marked in red: 



Da 

0 

0 

0 

D/3 

0 

D, 

0 

0 

Ds 


+ 


A(2). 


Since the diagonal blocks are not rank-deficient, we will extract them directly, without using randomized sam¬ 
pling. To describe the process, we assume at first (for simplicity) that every diagonal block has the same size, 
m X m. We then choose a test matrix of size N x m 


n 


and trivially extract the diagonal blocks via the sampling 


Y = Afi - A(2)f2 


L/q, 

D, 

Ds 


The diagonal blocks can then be read off directly from Y. 

For the general case where the leaves may be of different sizes, we form a test matrix Q of size N x m, where 
m = max{nT- : r is a leaf}, such that 


n{Ir ,:) = [ln.r zeros{nr,m - Ur)]. 


In other words, we simply pad a few zero columns at the end. 
The entire algorithm is summarized in Figure]^ 


3.1. Asymptotic complexity. Let L denote the number of levels in the tree. We find fhaf L ~ log N. Lef Tgop 
denote fhe fime required for a flop, lef Tmuit denote fhe fime required fo apply A or A* fo a vector, and lef T^{f) 
denote fhe lime required to apply A^^^ fo a veclor. Then fhe cosl of processing level i is 

, N 

Ti ~ T(nuit X k + Tflop X 2 k ^ 


since on level i Ihere are 2^ blocks to be processed, and each “long” malrix af Ihis level has heighl 2 ^ N and 
widfh 0{k). Furfher, we find fhaf fhe cosl of applying A^^^ to a single vector is 


E _[ \ _ 

2-^ /c — ~ Tflop X ikN. 

j=0 


Combining ([TT]) and ([T^ and summing from ^ = 0 fo ^ = L, we find (using fhaf L ~ log N) 


^compress 


r^,it X A: log )V + Tflop xk^N (log Nf. 


( 12 ) 


rsj 


(13) 




















Build compressed representations of all off-diagonal blocks. 
loop over levels i = D : (L — 1) 

Build the random matrices f2i and ^ 2 - 
= zeros(n, r) 

^2 = zeros(n, r) 
loop over boxes r on level I 

Let {a, /?} denote the children of box r. 

= randn(na,r) 

^ 2 ^/ 3 , 0 = randn(n/ 3 ,r) 
end loop 

Apply A to build the samples for the incoming basis matrices. 

Yi = Af22 - A^^a 
Ya = Afii - aWQi 

Orthonomialize the sample matrices to build the incoming basis matrices. 
loop over boxes r on level i 

Let {a, /?} denote the children of box r. 

W„ = qr(Yi(/„,:)). 

Up = qr{Y2{l0,:)). 

0 = 

fi2(//3,:)=W/3 

end loop 

Apply A* to build the samples for the outgoing basis matrices. 

Zi = A*n2- (A(^))*n2 
Za = A*ni - (AW)*ni 

Take local SVDs to build incoming basis matrices and sibling interaction matrices. 
We determine the actual rank, and update the W* basis matrices accordingly. 
loop over boxes r on level I 

Let {a, /?} denote the children of box r. 

[Va, 0/3] = svd(Zi(/Q,, :),e). 

[V/3, B„/3, Vq] = svd(Z2(//3,:), e). 

Up ■(— UpUp. 

Id ^ Id n B rv • 

end loop 
end loop 

Extract the diagonal matrices, 
nmax = iTiax {ur 1 T is a leaf} 
n = zeros(iV,nmax) 
loop over leaf boxes r 

^{It, 1 : Ur) = eyeiur). 
end loop 
Y = AfZ - 
loop over leaf boxes r 
Dr = Y(/r, 1 : Ur). 
end loop 


Figure 4. Randomized compression of a HODLR matrix. 



4. An algorithm for compressing an HBS matrix 


The algorithm for computing a compressed representation of an HBS matrix is a slight variation of the algorithm 
for a HODLR matrix described in Section The key difference is that the long basis matrices Ua and Va 
associated with any node now must satisfy a more rigorous requirement. We will accomplish this objective by 
constructing for every node a, two new “long” sampling matrices [Vq and Za, each of size Ua x r that help 
transmit information from the higher levels to the node a. The presentation will start in Section 4.1 with a 
description of the modification to the scheme of [^required to enforce the more rigorous requirement. In this 
initial description, we will assume that all four “long” matrices associated with a node (tlr, Vr, are 

stored explicitly, resulting in an 0{k N log A^) memory requirement, just like for the HODLR algorithm. In 
Section |4~2] we show that while these “long” matrices do need to be temporarily built and processed, they can be 
stored implicitly, which will allow the algorithm to use only 0{k N) memory. Finally, Section 4.3 will describe 
how to construct a representation using interpolatory decompositions in all low-rank approximations. 


4.1. A basic scheme for compressing an HBS matrix. Throughout this section, let a denote a node with a 
parent r that is not the root, and with a sibling j5. We will first describe the process for building the long basis 
matrices {Ut}t- To this end, recall that the difference in requirements on the long basis matrices in the two 
frameworks is as follows: 


HODLR framework: The columns of Ua need to span the columns of A(/q,, I^) 
HBS framework: The columns of Ua need to span the columns of /K{Ia,Ia)- 


The assertion that the requirements on a basis in the HBS framework is more stringent follows from the fact 
that Ip C /^, and that Ip is in general much smaller than /^. Now note that 

^{Ia, Pa) = [A(/„, Ip) A{Ia, /^)] P. (14) 

In ( [T4| ), the matrix P is a permutation matrix whose effect is to reorder the columns. For purposes of construct¬ 
ing a basis for the column space, the matrix P can be ignored. The idea is now to introduce a new sampling 
matrix 3 ^q, of size x r that encodes all the information that needs to be transmitted from the parent r. 
Specifically, we ask fhaf: 


The columns of span fhe columns of fii{Ia, If) (to wifhin precision e). 


Then, when processing box a, we will sample A(/a, Ip) using a Gaussian mafrix of size np x r jusf as in 
fhe HODLR algorifhm. In fhe end, we will build Ua by combining fhe fwo sefs of samples 


[W„, Dq,, ~] = svd( [A(/q,, Ip)Gp, 3^0], r). 


In ofher words, we lake a mafrix [A(/q,, Ip)Gp, 3 ^q,] of size x 2r and exlracl ils leading r singular com- 
ponenfs (fhe frailing r componenfs are discarded). All fhaf remains is fo build fhe sample mafrices and ys 
fhaf fransmif information fo fhe children {7, ti} of a. To be precise, lef and Js denote fhe local relative index 
vectors, so fhaf 

I-y — la ('^ 7 ) and F (5 — la i^Js)' 


Then sef 


3^7 — U{J^, :)Dq, 


and ys = U{J5,:)Da. 


The process for building fhe long basis mafrices {Va}a is entirely analogous fo fhe process described for 
building fhe { Uo}a mafrices. We firsf recall fhaf fhe difference befween fhen HODLR and fhe HBS frameworks 
is as follows: 


HODLR framework: The columns of Va need to span fhe rows of fii{Ip, la) 
HBS framework: The columns of Va need to span fhe rows of A(/^, /„). 




loop over levels £ = 0 : {L — 1) 

Build the random matrices Qi and ^ 2 - 
= zeros(n, r) 

5^2 = zeros(n, r) 
loop over boxes r on level i 

Let {a, /3} denote the children of box r. 
Qi{Ia,:) = randn(na,r) 

^2(1/3, ■) = randn(n/ 3 ,r) 
end loop 

Apply A to build the samples 
for the incoming basis matrices. 

Yi = Afia - 

Y 2 = Afii-A^^^Qi 

Orthonormalize the sample matrices to 
build the incoming basis matrices. 
loop over boxes t on level £ 

Let {a, /3} denote the children of box r. 
if (r is the root) 

Yl°==Yi (/<,,:) 

Y1°==Y2(/;3,:) 

else 

Y'r= [Yij;.(Ja,:)] 
Y^°"=[Y2(/;3,:), yr{Jp,-.)] 

end if 

= svd(Y']°‘=,r). 

[Z^/3,y/3,H = svd(Y 2 °^r). 
ya = Ua diag(y„). 
yp =Uf 3 diag(y^). 

^ 1(^05 0 = Ua 
^2{Ip, ■) — Up 

_ r U*aUr{Ja,-.) ' 

[u*pU,{Jpp.) _■ 

end loop 


Apply A* to build the samples 
for the outgoing basis matrices. 

Zi - ^*^2 - (A(^))*n2 
Z 2 = A*£2i - (A^^yni 

Take local SVDs to build incoming basis 
matrices and sibling interaction matrices. 
loop over boxes t on level £ 

Let {a, denote the children of box r. 
if (r is the root) 

Z'r = Zi(4, : ) 

= Z2(//3, : ) 

else 

Z'r = [Zi(/„, : ), Zr{Ja,-)\ 

Z^f^ = [Z2{Ip,-.), Z^iJpp.)] 

end if 

[Va,b2i,Xi] = svd(Z!|°‘',r). 

[V;3,bl2,X2] = Svd(Z^°^0• 

Za = Vadiag(b2i) 

Zp = V/3diag(bi2) 

Ba /3 = Xi(l : r, :)diag(bi2) 

B/ 3 a = X2(l : r, :)diag(b2i) 

_ r VlVriJa,--) ■ 

end loop 
end loop 

Extract the diagonal matrices, 
n-max = max {tIt : t is a leaf} 

n = zeros (A^,ninax) 
loop over leaf boxes r 

^{It, 1 : nr) = eye(nr)- 

end loop 

Y = An A^^^n 
loop over leaf boxes r 
D, = Y(/,,l:n,). 

end loop 


Figure 5. A basic scheme for compressing an HBS matrix. 

With P again denoting a permutation matrix, we have 

Atrc 7 \ _ p 

It follows that the role that was played by in the construction of is now played by a sampling matrix 
Za of size Ua X r that satisfies 

The columns of Za span the rows of A(/}:, la)- 
The algorithm for computing the HBS representation of a matrix is now given in Figure]^ 


4.2. A storage efficient scheme for compressing an HBS matrix. The scheme described in Section 4.1 


assumes that all “long” basis and spanning matrices {Ur-, Vt, 3 ^r! ^r) are stored explicitly, resulting in an 
0{k N log N) storage requirement. We will now demonstrate that in fact, only 0{k N) is required. 











First, observe that in the HBS framework, we only need to keep at hand the long basis matrices 11^ and Vt for 
nodes r on the level I that is currently being processed. In the HODLR compression algorithm in Figure we 
needed the long basis matrices associated with nodes on coarser levels in order to apply but in the HBS 
framework, all we need in order to apply is the long basis matrices {Ur-, Vt} on the level currently being 
processed, and then only the short basis matrices Ur and Mr for any box r on a level coarser than £, cf. the 
algorithm in Figure [T5| 


Next, observe that the long “spanning” matrices Vr and Zr that were introduced in Section 4.1 do not need to 
be stored explicitly either. The reason is that these matrices can be expressed in terms of the long basis matrices 
Ur and Vr- In fact, in the algorithm in Figure]^ we compute yr and Zr via the relations 


Vt = i^T diag(y,-) and .Zr = Vt ciiag(zT). 


Since the long basis matrices Ur and Vt are available during the processing of level i, we only need to store 
the short vectors and Zt, and can then construct Vt and Zr when they are actually needed. 


The memory efficient algorithm resulting from exploiting the observations described in this section is summa¬ 
rized in Figure]^ 


4.3. Adaptive rank determination and conversion to the HBS-ID format. The schemes presented in Sec¬ 
tions |4?^ and [4^ do not adaptively determine the ranks of the off-diagonal blocks being compressed. Instead, 
every block is factored using a preset uniform rank £ that must be picked to be larger than any actual numerical 
rank encountered. It is possible to incorporate adaptive rank determination in to the scheme, but we found it 
easier to perform this step in a second sweep that travels through the tree in the opposite direction: from smaller 
boxes to larger. In this second sweep, we also convert the standard HBS format to the HBS-ID format, which 
leads to a slight improvement in storage requirements, and improves interpretability of the sibling interaction 
matrices, as discussed in Section |T9] and Definition!^ 

The conversion to the HBS-ID is a “post-processing” step, so in what follows, we assume that the compression 
algorithm in Figurej^has already been executed so that both the HBS basis matrices Ur and Vt, and the sample 
matrix Vt and Zr are available for every node (these are stored implicitly in terms of the short basis matrices 
Ut and Mr, as described in Section |T^ . 

The first step is to sweep over all leaves r in the tree. For each leaf, we now have available spanning matrices 
Vt and Zr whose columns span the columns of A (It, I^) and /K{I{Ir)*, respectively. In order to find a set 
of spanning rows I™ of A(It,I.^) and a set of spanning columns I™* of A{I{Ir)*, ah we need to do is to 
compute interpolatory decompositions (IDs) of the small matrices Vt and Zr, cf. Remark|^ 

[Tin, Jin] = id(V},e), and [Tout, Jout] = id(Z;, e). (15) 

In equation ( [T5] ), we give the computational tolerance e as an input parameter. This reveals the “true” e-ranks 
fcin and lout- Then the skeleton index vectors for r are given by 

c = JT(Jin(l : fein)), and 1°"* = lT(Jout(l : feout))- 

Now define the subsampled basis matrices and via 

Usamp^U(Jin(l:fcin),0, and = V(Jout(l : A:out), O- ( 16 ) 

Once ah leaves have been processed in this manner, we can determine the sibling interaction matrices in the 
HBS-ID representation. Let {a,/3} denote a sibling pair consisting of two leaves. First observe that, by 
definition, 

Bj^^' = A(I“,I^"*). (17) 


Next, recall that 


A(/„,I^) = U„B„,;3V^. 


(18) 



loop over levels £ = 0 : (L — 1) 

Build the random matrices and Q.2- 
Qi = zeros(n, r) 

Q .2 = zeros(n, r) 
loop over boxes r on level £ 

Let {a, /3} denote the children of box t. 
Qi{Ia,:) = randn(na,r) 

^ 2 {Ip,') = randn(n/ 3 ,r) 
end loop 

Apply A to build the samples 
for the incoming basis matrices. 

Yi = Afia - 

Y 2 = AQi - 

Orthononnalize the sample matrices to 
build the incoming basis matrices. 
loop over boxes r on level ^ 

Let {a, /3} denote the children of box t. 
if (r is the root) 

Yf = Yi (/„,:) 

Y'r = Y2(/;3,:) 
else 

Y‘r = [Yi(/„,:), W,(J„,:)diag(y,)] 
Y 2 °^ = [Y2(//3,:), W.(J/3,:)diag(y,)] 

end if 

= svd(Y'°‘',r). 

[i^/3,y/3,H = svd(Y2°‘',r). 

■) = Ua 
0 = Up 

r uiu^j^,-.) - 

[uiUriJp,-.) _■ 

Delete Ut. 

end loop 


Apply A* to build the samples 
for the outgoing basis matrices. 

Zi = A*n2 - (A(^))*n2 
Z2 = A^fii - (A^^yni 

Take local SVDs to build incoming basis 
matrices and sibling interaction matrices. 
loop over boxes r on level £ 

Let {a, /?} denote the children of box r. 
if (r is the root) 

Zf = Zi(4, : ) 

Z^- = Z2{Ip, : ) 
else 

Z'r = [Zi(/a, : ), Vr(Ja,:)diag(z,)] 
Z^°^ = [Z2(J;3, : ), V.(J/ 3 ,:)diag(z,)] 

end if 

[Va,b 2 i,Xi] = svd(Z^°‘',r). 

[V;3,bl2,X2] = svd(Z^°^r). 

Za = b21 

Z/3 = bi2 

Ba/3 = Xi(l : r, :)diag(bi 2 ) 

B/ 3 a = X 2 (l : r, :)diag(b 2 i) 

_ r viVriJo.,:) ■ 

Delete Vr- 

end loop 
end loop 

Extract the diagonal matrices, 
nmax = max {n,- : r is a leaf} 

Q. = zeros (A^,nmax) 

loop over leaf boxes t 

^{Ir, 1 : tT-r) = eye(nr). 

end loop 

Y = Afi-A^^^n 
loop over leaf boxes t 
D, = Y(/,,l :n,). 

end loop 


Figure 6. A storage efficient algorithm for compressing an HBS matrix. 


Combining ( [T^ , ( [TT] ), and (1 81 , we find that 


B 


skel 

a,/3 




Once all leaves have been processed, we next proceed to the parent nodes. We do this by transversing the tree 
in the other direction, going from smaller to larger boxes. When a box r is processed, its children {a, /3} have 
already been processed. The key observation is now that that if we set = I™ U I™ and = 1°'^^ U 
then these index vectors form skeletons for r. These skeletons are inefficient, but by simply compressing the 
corresponding rows of 3 ^t- and Zr, we can build the skeletons and the interpolation matrices associated with 
r. Due to the self-similarity between levels in the HBS representation, the compression is entirely analogous 
to the compression of a leaf, with the only difference that the role played by the basis matrices Ur and Vt for 
a leaf, are now played by the sub-sampled matrices Utmp and Vtmp which represent the restriction of Ur and 









Execute the compression algorithm described in Figure 
loop over leaves r 

Ytnip — 

^tmp V^clisg(z^) 

[Tin,</in] = id(Yjjjjp,e) 

[Tout)<ZDut] = i‘i(^tmp)^) 

Usamp^U,(Ji„(l:fci„),:). 

Vsamp^V,(Jo„t(l:fcout),:). 

end loop 


u 


tmp 


V 


tmp 


u. 


V. 


loop over levels (. = [L — \) : (—1) : 1 
loop over boxes r on level ^ 

Let { a , /3} denote the children of box t. 

ysamp Q 

“o 

ysamp Q 

“o V““P 

Ytmp — titmpdi 6.'5(y^) 

Ztmp — Ytmpdicig(Z-7-) 

[Tin, •/in] = id(Yjjjjp,e) 

[Tontj^Zaut] = 

U;"“P = Utmp(Ji„(l:fci„),:). 

Vf“P = Vtmp(Jout(l:fcout),:). 

^(ysamp).^ 

end loop 
end loop 

Let {a, /3} denote the children of the root. 

gsM ^ ys^ampB^^^(y^amp)*^ 
gsM ^ ys^ampg^^^^ys^amp)*^ 


gskel ^ ysampg 

Q:,p C 

gskel _ y samp I 


Figure 7. An algorithm for computing the HBS-ID representation of a given matrix A. This 
algorithm adaptively determines the ranks of the off-diagonal blocks of A. 


Vr to the index rows and columns indicated by the index vectors I™ and 7°^*, respectively. The entire process 
is summarized in Figure |7] 


4.4. Asymptotic complexity. The asymptotic complexity for the HBSID algorithm is very similar to that for 
the HODLR algorithm. The key difference is that since A*^^^ is now applied using nested basis, we find 


N 


£-1 


~ Tflop X 2^ fe ^ 2^' ~ Tflop xkN. 


j=0 


In other words, the asymptotic complexity of applying r^(^) is now less by a factor of 0{£). In consequence. 


Tcompress ~ Tumult ^ A: log + Tflop X k'^ N log N. 


(19) 


5. Numerical experiments 


In this section, we present results from numerical experiments that substantiate claims on asymptotic complex¬ 
ity made in sections [3T] and |4~4| and demonstrate that the practical execution time is very competitive (in other 
words, that the scaling constants suppressed in the asymptotic analysis are moderate). We investigate four dif¬ 
ferent test problems: In Section [5T] we apply the randomized compression schemes to a discretized boundary 
integral operator for which other compression techniques are already available. This allows us to benchmark 
the new algorithms and verify their accuracy. In Section [5T] we demonstrate how the proposed scheme can be 
used to form a compressed representation of a product of two compressed matrices, thus demonstrated a way of 
circumventing the need for a complex and time-consuming structured matrix-matrix multiplication. In Section 


the “black-box” matrix-vector multiplication scheme (note that the data sparse format implicit in the FMM is 
much more cumbersome to invert than the HODLR and HBSID formats). In Section [5^ we apply the scheme 
to compress large dense matrices that arise in the classical “nested dissection” or “multifrontal” direct solvers 


5.3 we apply the schemes to a potential evaluation problem where we use the Fast Multipole Method (FMM) as 







for the sparse matrices arising from finite element or finite difference discretization of elliptic PDEs. The ability 
to efficiently manipulate such matrices allows for the construction of 0{N) complexity direct solvers for the 
associated sparse linear systems. 

For each of the four test problems, we compare two different techniques for computing a data-sparse repre¬ 
sentation of A: (1) The randomized technique for computing an HODLR-representation in Figure]^ (2) The 
randomized technique for computing an HBSID-representation in Figure]^ For each technique, we report the 
following quantities: 

N The number of DOFs (so that A is of size N x N). 

i The number of random vectors used at each level (i must be larger than the maximal rank). 

k The largest rank encountered in the compression. 

.^matvec The number of applications of A required (so that A^matvec = (T + 1) x £ ~ log(A/^) x £). 
Tcompress The time required for compression (in seconds). 

Tnet The time required for compression, excluding time to apply A and A* (in seconds). 

Tapp The time required for applying the compressed matrix to a vector (in seconds). 

M The amount of memory required to store A (in MB). 

The reason that we report the time Tnet (that does not count time spent in the black-box matrix-vector multiplier) 
is to validate our claims ( [T3] ) and ( [T^ regarding the asymptotic complexity of the method. To summarize, our 
predictions are, for the HODFR algorithm 

Tnet ~ N {logNf, Tapp ^N\ogN, M^N \og{N), 

and for the HBSID algorithm 

Tnet ~ N log N, Tapp ~ iV, M N. 

In addition to the timings, we computed a randomized estimate E of the compression error, computed as 
follows: We drew ten vectors of unit length from a uniform distribution on the unit sphere in 

Then E is defined via 

7-1 11 ^compressed^i 11 

E = max -r—-—-. 


5.1. Compressing a Boundary Integral Equation. Our first numerical example concerns compression of a 
discretized version of the Boundary Integral Equation (BIE) 


+ 


{x-y) -niy) 


q{y)ds{y) = f{x) 


£c E r, 


( 20 ) 


47r|£C — 

where T is the simple curve shown in Figure]^ and where n{y) is the outwards pointing unit normal of T at 
y. The BIE (201 is a standard integral equation formulation of the Faplace equation with boundary condition / 
on the domain interior to r|^ We discretize the BIE (201 using the Nystrdm method on N equispaced points on 
r, with the Trapezoidal rule as the quadrature. Note that the kernel in (20 1 is smooth, so the Trapezoidal rule 
has exponential convergence. This problem is slightly artificial in fhaf only abouf 200 poinfs are needed fo 
affain full double precision accuracy in fhe discrefizafion. We include if for bench-marking purposes fo verify 
fhe scaling of fhe proposed mefhod. 


To be precise, fhe mafrix A used in fhis numerical experimenf is ifself an HBS represenfafion of fhe mafrix 
resulting from discrefizafion of ( [20| ), compufed using fhe technique of ESI . To minimize the risk of spurious 
effects due to both the “exact” and the computed A being HBS representations, we used a much higher precision 
in computing the “exact” A, and also a shifted tree structure to avoid having the compressed blocks of our 
reference A align with the compressed blocks constructed by the randomzed sampling algorithms. 


1 Verify! 
^Check. 








Figure 8 . Contour F on which the BIE ( |20l ) in Section [5TT] is defined. 


N 

matvec 

T 

compress 

(sec) 

Tnet 

(sec) 

T 

-tapp 

(sec) 

M 

(MB) 

M/N 

(reals) 

E 

k 

400 

3x35 

0.065 

0.035 

0.001 

0.5 

157.7 

4.68e-ll 

28 

800 

4x35 

0.107 

0.033 

0.001 

1.0 

169.9 

3.67e-ll 

28 

1600 

5x35 

0.267 

0.081 

0.003 

2.2 

179.1 

3.09e-ll 

28 

3200 

6x35 

0.617 

0.188 

0.006 

4.5 

186.2 

2.57e-ll 

28 

6400 

7x35 

1.458 

0.456 

0.011 

9.5 

194.0 

1.99e-ll 

29 

12800 

8x35 

3.453 

1.198 

0.024 

19.6 

200.5 

1.80e-ll 

30 

25600 

9x35 

8.065 

2.998 

0.047 

39.7 

203.1 

1.74e-ll 

29 

51200 

10x35 

18.362 

7.140 

0.099 

82.1 

210.3 

1.14e-ll 

30 

102400 

11 x35 

41.985 

17.366 

0.204 

166.6 

213.2 

1.24e-ll 

30 


Table 1. Compression to th e HO DLR format using Algorithm|^of the double layer integral 


equation described in Section 5.1 Here e = 10 ^ and I = 35. 


N 

^matvec 

T 

^ compress 

(sec) 

T-aet 

(sec) 

T 

-tapp 

(sec) 

M 

(MB) 

M/N 

(reals) 

E 

k 

400 

3x35 

0.098 

0.068 

0.001 

0.4 

127.7 

1.40e-09 

25 

800 

4x35 

0.163 

0.090 

0.002 

0.7 

113.6 

1.30e-09 

25 

1600 

5x35 

0.381 

0.193 

0.005 

1.3 

104.3 

1.42e-09 

24 

3200 

6x35 

0.840 

0.411 

0.009 

2.4 

98.3 

1.32e-09 

24 

6400 

7x35 

1.881 

0.878 

0.019 

4.6 

94.2 

1.56e-09 

24 

12800 

8x35 

4.258 

1.969 

0.037 

8.9 

91.3 

2.04e-09 

23 

25600 

9x35 

9.262 

4.205 

0.076 

17.6 

90.3 

1.62e-09 

23 

51200 

10x35 

20.431 

9.238 

0.153 

34.4 

87.9 

1.44e-09 

23 

102400 

11 x35 

45.732 

21.039 

0.310 

68.3 

87.5 

1.61e-09 

23 


Table 2. Compression to th e HB SID format using Algorithmof the double layer integral 


equation described in Section |5.1| Here e = 10 ® and i = 35. 


For this experiment, we bench-mark the new compression algorithms by comparing their speed, accuracy, and 
memory requirements to the compression technique based on potential theory described in |[28l . run at the 
same precision as the randomized compression scheme. The results are presented in tables [T] and and 
summarized in Figure]^ 

5.2. Operator multiplication. We next apply the proposed scheme to compute the Neumann-to-Dirichlet 
operator for the boundary value problem 


-Au(x) = 0 
dnu(x) = g(x) 


X G fl, 
X gF, 


(21) 

























N 

T 

^ compress 

(sec) 

^app 

(sec) 

M 

(MB) 

M/N 

(reals) 

E 

k 

400 

0.053 

0.001 

0.4 

128.3 

1.51e-09 

38 

800 

0.104 

0.002 

0.7 

119.5 

2.57e-09 

37 

1600 

0.194 

0.005 

1.4 

113.6 

3.57e-09 

35 

3200 

0.352 

0.009 

2.7 

109.9 

4.58e-09 

34 

6400 

0.674 

0.019 

5.3 

108.2 

7.42e-09 

33 

12800 

1.368 

0.041 

10.5 

107.1 

2.02e-08 

31 

25600 

2.664 

0.079 

20.8 

106.7 

1.89e-08 

29 

51200 

5.308 

0.162 

41.5 

106.3 

2.51e-08 

28 

102400 

10.758 

0.327 

82.9 

106.1 

3.95e-08 

25 


Table 3. Co mpression to the HBSID format of the double layer integral equation described 
in Section 5.1 using the technique based on potential theory of ll28]l with e = 10“®. 


where T is again the contour shown in Figure where fl is the domain exterior to F, and where n is the unit 
normal vector pointing in the outwards direction from F. With u the solution of (21 1 , let / denote the restriction 
of u to F, and let T denote the linear operator T : g ^ f, known as the Neumann-to-Dirichlet (NtD) operator. 
It is weU-known (see Remark]^ that T can be built explicitly as the product 

T = + , (22) 


where S is the single-layer operator [iS'q](a;) = /p — 5^ log \x — y \ q{y) ds{y), and where D* is the adjoint of 
the double-layer operator [D*q]{x) = Jp ^ 2 w\x-y\^'^ ds{y). We form discrete approximations S and D* 

to S and D*, and compute (|l -f using the techniques in ifTSl . with 6**^ order Kapur-Rokhlin quadrature 

used to discretize the singular integral operator S. Then we can evaluate a discrete approximation T to T via 

T = S (l| + D-)-‘. 


For this example, we evaluated an additional error metric by testing the computed NtD operator to an exact 
solution Mexact tO The function rtexact is given as the potential from a collection of five randomly placed 
charges inside F, and then fexact and ggxact are simply the evaluations of rxexact and its normal derivative on the 
quadrature nodes on F. Then the new error measure is given by 

jji _ 11 fexact ~ Tapprox Sexact 11 max 

J-ypot — Ilf II ’ 

11 * exact 11 max 

where || • ||max is the maximum norm, and where Tapprox is the compressed representation of T determined by 
the randomized sampling scheme proposed. 

The numerical results are presented in tables]^ andand summarized in Figure [T0| 

Remark 6. The formula \22\ for the NtD operator is derived as follows: We first look for a solution to ( |2jp of 
the form u = Sq. Then it can be shown that q must satisfy {l/2)q + D*q = g. Solving for q and using f = Sq, 
we obtain ( |^ . 


5.3. Dimensional reduction in the Fast Multipole Method. In this section, we investigate a numerical ex¬ 
ample where A is a potential evaluation map for a set of electric charges in the plane. To be precise, we let 
denote a set of points located as shown in Figure[^ Then A is the N x N matrix with entries 

-T^loglaji - Xj\, 


A(i,i) 


0 


when i / j, 
when i = j. 
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Figure 9. Visualization of the results presented in Tables [T][^ and[^ pertaining to the exam¬ 
ple in Section [5T] 


We apply A rapidly using the classical Fast Multipole Method ifTTl with 30th order expansions, whieh ensures 
that the error in the black-box code is far smaller than our requested precision of e = 10“^. Our FMM is 
implemented in Matlab as described in lIZTll . This implementation is quite inefficient, but has the ability to 
apply A to an W X £ matrix rather than single vector. 

For this example, we use an additional error measure Spot that compares the compressed matrix Acompress to 
the exact matrix A, evaluated at a subset of the target points. To be preeise, we picked at random a subset of 
10 target loeations, marked by the index vector I C {1, 2, N}. We then drew a sequence of ten veetors 

iri which each entry is drawn at random from a uniform distribution on the interval [0,1] (observe that 


















































N 

^matvec 

T 

.L compress 

(sec) 

Tnet 

(sec) 

T 

(sec) 

M 

(MB) 

M/N 

(reals) 

E 

Epot 

k 

400 

2x60 

0.069 

0.030 

0.001 

0.6 

200.3 

2.06e-ll 

3.87e-07 

32 

800 

3x60 

0.123 

0.034 

0.001 

1.5 

237.7 

4.35e-ll 

8.22e-09 

35 

1600 

4x60 

0.323 

0.088 

0.002 

3.5 

283.1 

4.00e-ll 

3.06e-09 

40 

3200 

5x60 

0.839 

0.236 

0.005 

8.0 

327.7 

7.18e-ll 

9.74e-09 

44 

6400 

6x60 

2.093 

0.657 

0.011 

18.1 

370.0 

1.08e-10 

8.37e-09 

45 

12800 

7x60 

5.108 

1.734 

0.024 

40.7 

417.0 

1.61e-10 

1.76e-08 

49 

25600 

8x60 

12.302 

4.516 

0.055 

89.3 

457.2 

3.09e-10 

3.41e-08 

52 

51200 

9x60 

29.380 

11.636 

0.120 

195.8 

501.2 

5.44e-10 

7.22e-08 

54 

102400 

10x60 

67.879 

29.072 

0.284 

426.0 

545.3 

l.lle-09 

3.39e-07 

55 


Table 4. Comp ressi on to the HODLR format using Algorithm of the NtD operator de¬ 
scribed in SectionHere e = 10“® and i = 60. 


N 

^matvec 

T 

.r compress 

(sec) 

Tnet 

(sec) 

T 

J.app 

(sec) 

M 

(MB) 

M/N 

(reals) 

E 

-E'pot 

k 

400 

2x60 

0.093 

0.062 

0.001 

0.6 

211.7 

3.48e-10 

3.86e-07 

29 

800 

3x60 

0.190 

0.103 

0.001 

1.2 

198.7 

2.70e-10 

8.25e-09 

31 

1600 

4x60 

0.482 

0.245 

0.003 

2.4 

192.7 

5.32e-10 

5.42e-09 

33 

3200 

5x60 

1.154 

0.557 

0.006 

4.6 

189.9 

1.57e-09 

l.Ole-08 

36 

6400 

6x60 

2.708 

1.269 

0.012 

9.1 

186.5 

2.02e-09 

1.35e-08 

37 

12800 

7x60 

6.310 

2.906 

0.023 

18.1 

185.4 

3.98e-09 

2.51e-08 

40 

25600 

8x60 

14.495 

6.511 

0.048 

35.6 

182.3 

6.80e-09 

4.72e-08 

41 

51200 

9x60 

32.906 

15.033 

0.094 

70.8 

181.4 

1.47e-08 

8.04e-08 

42 

102400 

10x60 

82.238 

37.772 

0.189 

139.8 

178.9 

2.51e-08 

3.46e-07 

44 


Table 5. Comp ressi on to the HODLR format using Algorithm of the NtD operator de¬ 
scribed in SectionHere e = 10“® and £ = 60. 


every charge is positive). Then E'pot is defined via 


T^pot = max 


|A(/, '.)u)i A(.Qjjip].essed(.^) 0^* 


i<i<io ||A(/, :)cjj|| 

The numerical results are presented in tables and [7| and summarized in Figure [T^ 


5.4. Compression of frontal matrices in nested dissection. Our final example applies the proposed compres¬ 
sion schemes to the problem of constructing 0{N) direct solvers for the sparse linear systems arising upon the 
discretization of elliptic PDFs via finite difference or finite element methods. The idea is to build a solver on the 
classical “nested dissection” scheme of George ifT^ [TTl [TOl . In standard implementations, the problem of this 
direct solver is that it requires the inversion or LU-factorization of a set of successively larger dense matrices. 
However, it has recently been demonstrated that while these matrices are dense, they have internal structure that 
allows for linear or close to linear time matrix algebra to be executed, ll24ll^l^[T4l . In this manuscript, we 
test the proposed compression scheme on a set of matrices whose behavior is directly analogous to the matrices 
encountered in the algorithms of ll24l |34l |30l O. To be precise, let B denote the sparse coefficient matrix 
associated with a grid conduction problem on the grid shown in Figure Each bar has a conductivity that is 
drawn at random from a uniform distribution on the interval [1,2]. Let Ii,l 2 , h denote three index vectors that 
mark the three regions shown in Figure [T^ set 


hj = 1,2,3 































Time for matrix compression (seconds) 


Time for matrix-vector multiplication (seconds) 
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Figure 10. Visualization of the results presented in Tables]^ and|^ pertaining to the example 
in Section l5!2l 


and then define the N x N matrix A via 

A = B 33 — B 3 iB;^/Bi 3 — B32B^2^B23. (23) 

The relevance of the matrix A is discussed in some detail in Remark|7j 


In our numerical experiments, the black-box application of A, as defined by (231 was executed using the 
sparse matrix built-in routines in Matlab, which relies on UMFPACK ||9l for the sparse solves implicit in 


the application of and B 


22 


The numerical results are presented in tables and and summarized in Figure [T4| 

Remark 7. To illustrate the connection between the matrix A, as defined by ( |23p , and the LU-factorization of 
a matrix such as B, observe first that the blocks B 12 and B 21 are zero, so that (up to a permutation of the rows 
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Figure 11. Geometry of the potential evaluation map in Section 5.3 The top figure shows 
the entire geometry, and the lower figure shows a magnification of a small parf. 


N 

^matvec 

T 

-L compress 

(sec) 

Tnet 

(sec) 

T 

-tapp 

(sec) 

M 

(MB) 

M/N 

(reals) 

E 

Epot 

k 

400 

3x45 

0.101 

0.021 

0.001 

0.5 

173.9 

1.57e-10 

1.16e-10 

21 

800 

4x45 

0.265 

0.055 

0.001 

1.3 

215.9 

3.69e-10 

3.52e-10 

23 

1600 

5 X 45 

0.812 

0.133 

0.003 

3.1 

257.8 

1.09e-10 

1.46e-10 

24 

3200 

6x45 

1.925 

0.321 

0.007 

7.4 

304.9 

6.17e-ll 

8.77e-ll 

26 

6400 

7x45 

4.509 

0.783 

0.015 

17.2 

351.5 

4.22e-ll 

5.61e-ll 

28 

12800 

8x45 

10.980 

1.941 

0.032 

39.1 

400.6 

3.11e-ll 

3.63e-ll 

30 

25600 

9x45 

26.271 

4.788 

0.068 

87.1 

446.0 

2.56e-ll 

3.38e-ll 

30 

51200 

10x45 

62.119 

11.938 

0.149 

192.5 

492.7 

2.26e-ll 

2.67e-ll 

32 

102400 

11 x45 

158.617 

30.621 

0.362 

419.5 

537.0 

1.96e-ll 

2.19e-ll 

32 


Table 6 . Compression fo the HODLR format using Algorithm]^ of the potential evaluation 
matrix described in Section 1531 Here e = 10“® and i = 45. 


N 

^matvec 

T 

-r compress 

(sec) 

Tnet 

(sec) 

T 

-tapp 

(sec) 

M 

(MB) 

M/N 

(reals) 

E 

-E'pot 

k 

400 

3x45 

0.140 

0.062 

0.001 

0.5 

164.1 

3.84e-09 

2.96e-09 

31 

800 

4x45 

0.358 

0.149 

0.002 

1.0 

164.8 

2.29e-08 

2.10e-08 

33 

1600 

5x45 

1.007 

0.329 

0.005 

2.0 

166.1 

1.12e-08 

1.28e-08 

37 

3200 

6x45 

2.306 

0.713 

0.010 

4.1 

166.5 

1.29e-08 

1.08e-08 

39 

6400 

7x45 

5.334 

1.554 

0.020 

8.0 

164.7 

8.47e-09 

5.13e-09 

40 

12800 

8x45 

12.366 

3.350 

0.040 

15.8 

161.7 

7.75e-09 

4.66e-09 

42 

25600 

9x45 

28.813 

7.170 

0.081 

30.9 

158.1 

7.28e-09 

7.56e-09 

43 

51200 

10x45 

64.838 

15.572 

0.166 

60.1 

154.0 

1.47e-08 

9.91e-09 

44 

102400 

11 x45 

164.750 

36.686 

0.335 

116.7 

149.4 

1.23e-08 

1.03e-08 

43 


Table 7. Compression to the HBSID format using Algorithm [7] of the potential evaluation 
matrix described in Section 133 Here e = 10“® and I = 45. 
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Figure 12. Visualization of the results presented in Tables]^ and|^ pertaining to the example 
in Section 1531 


Next suppose that we can somehow determine the LU-factorizations o/Bn and B 22 , 


Bii = LiiUii, and B 22 = ^ 22 ^ 22 - 
Then the LU-factorization of B is given by 



Ln 

0 

0 


Uii 

0 

Ln Bi3 ■ 

B = 

0 

L 22 

0 


0 

U 22 

L22^B23 


1 

DO 

OJ 

B 32 U 22 ' 

1-33 


0 

0 

U 33 


where L 33 and U 33 are defined as the LU factors of 

L 33 U 33 = B 33 - B3iU]^/Lj“/Bi3 - B32U22^L22^B23 • 

'-V-' 

=:A 


(24) 
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Figure 13. Geometry of problem described in Section 5.4 We consider a grid conduction 
problem on the grid shown. As N is increased, the width or the grid is fixed at 41 nodes, while 
the height of the grid equals N. 
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^matvec 

T 

compress 

(sec) 

Tnet 

(sec) 

T 

Xa,pp 

(sec) 

M 

(MB) 

M/N 

(reals) 

E 

k 

400 

3x25 

0.348 

0.012 

0.001 

0.4 

115.0 

1.83e-14 

9 

800 

4x25 

0.849 

0.025 

0.001 

0.8 

133.4 

1.47e-14 

9 

1600 

5x25 

2.143 

0.064 

0.002 

1.9 

151.6 

1.40e-14 

9 

3200 

6x25 

5.588 

0.141 

0.005 

4.1 

169.7 

1.43e-14 

9 

6400 

7x25 

13.926 

0.336 

0.011 

9.2 

187.8 

1.37e-14 

9 

12800 

8x25 

38.558 

0.903 

0.023 

20.1 

205.8 

1.36e-14 

9 

25600 

9x25 

91.382 

2.322 

0.048 

43.7 

223.8 

1.29e-14 

9 

51200 

10x25 

208.704 

5.874 

0.098 

94.5 

241.8 

1.32e-14 

9 

102400 

11 x25 

468.981 

14.203 

0.204 

203.0 

259.8 

1.30e-14 

9 


Table 8. Compression to the HODLR format using Algorithm ^of a simulated “frontal ma¬ 
trix” in the nested dissection technique, as described in Section 5.4 Here e = 10“® and = 25. 


N 

^matvec 

T 

compress 

(sec) 

Tnet 

(sec) 

T 

J-app 

(sec) 

M 

(MB) 

M/N 

(reals) 

E 

k 

400 

3x25 

0.361 

0.032 

0.001 

0.4 

121.0 

3.80e-14 

18 

800 

4x25 

0.867 

0.066 

0.002 

0.8 

124.9 

4.75e-14 

18 

1600 

5 x25 

2.180 

0.150 

0.005 

1.6 

127.7 

4.34e-14 

18 

3200 

6x25 

5.575 

0.322 

0.010 

3.2 

129.5 

4.27e-14 

18 

6400 

7x25 

14.470 

0.708 

0.019 

6.4 

130.6 

4.30e-14 

18 

12800 

8x25 

38.945 

1.525 

0.040 

12.8 

131.3 

4.19e-14 

18 

25600 

9x25 

91.563 

3.253 

0.081 

25.7 

131.7 

4.11e-14 

18 

51200 

10x25 

208.145 

7.440 

0.157 

51.5 

131.9 

4.10e-14 

18 

102400 

11 x25 

465.309 

15.759 

0.314 

103.1 

132.0 

4.07e-14 

18 


Table 9. Compression to the HBSID format using Algorithm 7] of a simulated “frontal ma¬ 
trix” in the nested dissection technique, as described in Section 5.4 Here e = 10“® and i = 25. 


Observe that since the matrices Ln, Du, L22, U22 till triangular, their inverses are inexpensive to apply. 
To summarize, if one can cheaply evaluate the LU factorization {24\, then the task of LU-factoring B directly 
reduces to the task of LU factoring the two matrices Bn and B 22 , which both involve about half as many 
variables as B. The classical nested dissection idea is now to apply this observation recursively to factor Bn 
and B22- The problem of this scheme has traditionally been that in order to evaluate L33 and U33 in ( | 24 | ), one 
must factorize the dense matrix A. 
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Figure 14. Visualization of the results presented in Tablesj^ and|^ pertaining to the example 
in Section [531 


5.5. Summary of observations from numerical experiments. To close this section, we make some observa¬ 
tions and conjectures: 

• In all examples examined, numerical evidence supports the claims on asymptotic scaling made in Sec¬ 
tions 123] and HU 

• Excellent approximation errors are obtained in every case. Aggregation of errors over levels is a very 
minor problem. 

• The computational time is in all cases dominated by the time spent in the external black-box multiplier. 
As a consequence, the primary route by which the proposed algorithm could be improved would be to 
lessen the number of matrix-vector multiplications required. 





















































Build outgoing expansions on level m. 
loop over all nodes r on level m 
q, = V;q(/.) 

end loop 


Build outgoing expansions on levels coarser than m (upwards pass). 
loop over levels i = (m — 1) : (—1) : 1 
loop over boxes r on level i 

denote the children of r. 


Let {a, fd 

a. = v: 


end loop 
end loop 


aa 

a/j J 


Build incoming expansions for the children of the root. 

Let {a, /?} denote the children of the root node, 
a® a/3- 

U/3 = B/3,a aa- 

Build incoming expansions on levels coarser than m (downwards pass). 
loop over levels i = (m — 1) -. (—1) : 1 
loop over boxes r on level i 

Let {a, /3} denote the children of r. 

Uq — Bq;^/ 3 q^ Ut-(o/q,, .) Ut. 

ao “h .) Ut- 

end loop 
end loop 


Build incoming expansions on level m. 
loop over boxes r on level m 
U{lr) = IAt Ur 

end loop 


Figure 15. Application of in the HBS framework. Given a vector q of charges, build 
the vector u = q of potentials. 


• For modest problem sizes, the HODLR algorithm is very fast and easy to use. However, as problems 
grow large, the memory requirements of the HODLR format become slightly problematic, and the 
HBSID algorithm gains a more pronounced advantage. 


References 

[1] S. Ambikasaran and E. Darve, An o{n\ogn) fast direct solver for partial hierarchically semi-separable matrices. Journal 
of Scientific Computing, 57 (2013), pp. 477-501. 

[2] M. Bebendorf, Approximation of boundary element matrices, Numerische Mathematik, 86 (2000), pp. 565-589. 

[3] -, Hierarchical matrices, vol. 63 of Lecture Notes in Computational Science and Engineering, Springer-Verlag, Berlin, 2008. 

A means to efficiently solve elliptic boundary value problems. 

[4] S. Borm, Efficient numerical methods for non-local operators, vol. 14 of EMS Tracts in Mathematics, European Mathematical 
Society (EMS), Zurich, 2010. Tf^-matrix compression, algorithms and analysis. 

[5] S. Borm and L. Grasedyck, Hybrid cross approximation of integral operators, Numerische Mathematik, 101 (2005), pp. 221- 
249. 

[6] T. E. Chan, Rank revealing qrfactorizations. Linear Algebra and its Applications, 88-89 (1987), pp. 67 - 82. 






[7] S. Chandrasekaran, M. Gu, and W. Lyons, A/asf adaptive solver for hierarchically semiseparable representations, Cal- 
colo, 42 (2005), pp. 171-185. 

[8] H. Cheng, Z. Gimbutas, P. Martinsson, and V. Rokhlin, On the compression of low rank matrices, SIAM Journal of 
Scientific Computing, 26 (2005), pp. 1389-1404. 

[9] T. A. Davis, Algorithm 832: Umfpack v4. 3—an unsymmetric-pattern multifrontal method, ACM Transactions on Mathematical 
Software (TOMS), 30 (2004), pp. 196-199. 

[10] -, Direct methods for sparse linear systems, vol. 2, Siam, 2006. 

[11] 1. Dupe, A. Erisman, and J. Reid, Direct Methods for Sparse Matrices, Oxford, 1989. 

[12] K. Frederix and M. V. Barel, Solving a large dense linear system by adaptive cross approximation. Journal of Computational 
and Applied Mathematics, 234 (2010), pp. 3181 - 3195. 

[13] A. George, Nested dissection of a regular finite element mesh, SIAM J. on Numerical Analysis, 10 (1973), pp. 345-363. 

[14] A. Gillman and P. Martinsson, A direct solver with o{n) complexity for variable coefficient elliptic pdes discretized via a 
high-order composite spectral collocation method, SIAM Journal on Scientific Computing, 36 (2014), pp. A2023-A2046. 

[15] A. Gillman, P. Young, and P.-G. Martinsson, A direct solver o{n) complexity for integral equations on one-dimensional 
domains. Frontiers of Mathematics in China, 7 (2012), pp. 217-247. 10.1007/sl 1464-012-0188-3. 

[16] G. H. Golub and C. F. Van Loan, Matrix computations, Johns Hopkins Studies in the Mathematical Sciences, Johns Hopkins 
University Press, Baltimore, MD, third ed., 1996. 

[17] L. Greengard and V. Kommian, A fast algorithm for particle simulations, J. Comput. Phys., 73 (1987), pp. 325-348. 

[18] M. Gu AND S. C. Eisenstat, Efficient algorithms for computing a strong rank-revealing QR factorization, SIAM J. Sci. Corn- 
put., 17 (1996), pp. 848-869. 

[19] W. Hackbusch, a sparse matrix arithmetic based on H-matrices; Part 1: Introduction to H-matrices, Computing, 62 (1999), 
pp. 89-108. 

[20] W. Hackbusch and S. Borm, Data-sparse approximation by adaptive I-L^-matrices, Computing, 69 (2002), pp. 1-35. 

[21] W. Hackbusch, B. Khoromskij, and S. Sauter, On T-L^-matrices, in Lectures on Applied Mathematics, Springer Berlin, 
2002, pp. 9-29. 

[22] N. Halko, P.-G. Martinsson, and J. A. Tropp, Finding structure with randomness: Probabilistic algorithms for construct¬ 
ing approximate matrix decompositions, SIAM Review, 53 (2011), pp. 217-288. 

[23] D. Huybrechs, Multiscale and hybrid methods for the solution of oscillatory integral equations, PhD thesis, Katholieke Uni- 
versiteit Leuven, 2006. 

[24] S. Le Borne, L. Grasedyck, and R. Kriemann, Domain-decomposition based H-LUpreconditioners, in Domain decom¬ 
position methods in science and engineering XVl, vol. 55 of Lect. Notes Comput. Sci. Eng., Springer, Berlin, 2007, pp. 667-674. 

[25] L. Lin, j. Lu, and L. Ying, Fast construction of hierarchical matrix representation from matrixvector multiplication. Journal 
of Computational Physics, 230 (2011), pp. 4071 - 4087. 

[26] P. Martinsson, Rapid factorization of structured matrices via randomized sampling, 2008. arXiv:0806.2339. 

[27] -, A fast randomized algorithm for computing a hierarchically semiseparable representation of a matrix, SIAM Journal on 

Matrix Analysis and Applications, 32 (2011), pp. 1251-1274. 

[28] P. Martinsson and V. Rokhlin, A fast direct solver for boundary integral equations in two dimensions, J. Comp. Phys., 205 
(2005), pp. 1-23. 

[29] -, An accelerated kernel independent fast multipole method in one dimension, SIAM Journal of Scientific Computing, 29 

(2007), pp. 1160-11178. 

[30] P.-G. Martinsson, A fast direct solver for a class of elliptic partial differential equations, J. Sci. Comput., 38 (2009), pp. 316- 
330. 

[31] P.-G. Martinsson, V. Rokhlin, and M. Tygert, A randomized algorithm for the approximation of matrices. Tech. Report 
Yale CS research report YALEU/DCS/RR-1361, Yale University, Computer Science Department, 2006. 

[32] -, A randomized algorithm for the decomposition of matrices, Appl. Comput. Harmon. Anal., 30 (2011), pp. 47-68. 

[33] J. XiA, Randomized sparse direct solvers, SIAM Journal on Matrix Analysis and Applications, 34 (2013), pp. 191-221. 

[34] J. XiA, S. Chandrasekaran, M. Gu, and X. S. Ll, Superfast multifrontal method for large structured linear systems of 
equations, SIAM J. Matrix Anal. Appl., 31 (2009), pp. 1382-1411. 

[35] -, Fast algorithms for hierarchically semiseparable matrices. Numerical Linear Algebra with Applications, 17 (2010), 

pp. 953-976. 



